AITopics | universal approximator

Your Transformer May Not be as Powerful as You Expect

Neural Information Processing SystemsApr-24-2026, 22:26:22 GMT

Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative--RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Efficient Equivariant Transfer Learning from Pretrained Models

Neural Information Processing SystemsApr-24-2026, 18:48:58 GMT

Efficient transfer learning algorithms are key to the success of foundation models on diverse downstream tasks even with limited data. Recent works of Basu et al. (2023) and Kaba et al. (2022) propose group averaging (equitune) and optimizationbased methods, respectively, over features from group-transformed inputs to obtain equivariant outputs from non-equivariant neural networks. While Kaba et al. (2022) are only concerned with training from scratch, we find that equitune performs poorly on equivariant zero-shot tasks despite good finetuning results. We hypothesize that this is because pretrained models provide better quality features for certain transformations than others and simply averaging them is deleterious. Hence, we propose λ-equitune that averages the features using importance weights, λs. These weights are learned directly from the data using a small neural network, leading to excellent zero-shot and finetuned results that outperform equitune. Further, we prove that λ-equitune is equivariant and a universal approximator of equivariant functions. Additionally, we show that the method of Kaba et al. (2022) used with appropriate loss functions, which we call equizero, also gives excellent zero-shot and finetuned performance.

Add feedback

The Expressive Power of Neural Networks: A View from the Width

Neural Information Processing SystemsMar-17-2026, 13:30:49 GMT

The expressive power of neural networks is important for understanding deep learning. Most existing works consider this problem from the view of the depth of a network. In this paper, we study how width affects the expressiveness of neural networks. Classical results state that depth-bounded (e.g.

artificial intelligence, machine learning, proceedings, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Constrained Reinforcement Learning Has Zero Duality Gap

Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, Alejandro Ribeiro

Neural Information Processing SystemsMar-13-2026, 08:55:51 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, dual problem, parametrization, (14 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Dense Neural Networks are not Universal Approximators

Rauchwerger, Levi, Jegelka, Stefanie, Levie, Ron

arXiv.org Machine LearningFeb-12-2026

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Machine Learning

2602.07618

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Austria (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Inverse M-Kernels for Linear Universal Approximators of Non-Negative Functions

Neural Information Processing SystemsFeb-11-2026, 03:36:13 GMT

Kernel methods are widely utilized in machine learning field to learn, from training data, a latent function in a reproducing kernel Hilbert space. It is well known that the approximator thus obtained usually achieves a linear representation, which brings various computational benefits, while maintaining great representation power (i.e., universal approximation). However, when non-negativity constraints are imposed on the function's outputs, the literature usually takes the kernel method-based approximators as offering linear representations at the expense of limited model flexibility or good representation power by allowing for their nonlinear forms. The main contribution of this paper is to derive a sufficient condition for a positive definite kernel so that it may construct flexible and linear approximators of non-negative functions. We call a kernel function that offers these attributes an inverse M-kernel; it is a generalization of the inverse M-matrix. Furthermore, we show that for a one-dimensional input space, universal exponential/Abel kernels are inverse M-kernels and construct linear universal approxima-tors of non-negative functions. To the best of our knowledge, it is the first time that the existence of linear universal approximators of non-negative functions has been elucidated. We confirm the effectiveness of our results by experiments on the problems of non-negativity-constrained regression, density estimation, and intensity estimation. Finally, we discuss issues and perspectives on multi-dimensional input settings.

artificial intelligence, inverse m-kernel, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Chile (0.04)
North America > United States > Virginia (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

SupplementaryMaterialof TowardsScale-InvariantGraph-related ProblemSolvingbyIterativeHomogeneous GraphNeuralNetworks

Neural Information Processing SystemsFeb-9-2026, 23:13:30 GMT

Tomakesure functions represented by neural networks are homogeneous, we also prove that HomoGNN and HomoMLP can only represent homogeneous functions, in Section B.2.2.

artificial intelligence, machine learning, module, (19 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

a32d7eeaae19821fd9ce317f3ce952a7-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 15:35:49 GMT

artificial intelligence, machine learning, matrix, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

0d02892a0055c94584f6394f8d069c8e-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 19:16:07 GMT

equitune, equizero, loss function, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.05)
Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory

Neural Information Processing SystemsDec-27-2025, 03:02:54 GMT

State-space models have gained popularity in sequence modelling due to their simple and efficient network structures. However, the absence of nonlinear activation along the temporal direction limits the model's capacity. In this paper, we prove that stacking state-space models with layer-wise nonlinear activation is sufficient to approximate any continuous sequence-to-sequence relationship. Our findings demonstrate that the addition of layer-wise nonlinear activation enhances the model's capacity to learn complex sequence patterns. Meanwhile, it can be seen both theoretically and empirically that the state-space models do not fundamentally resolve the issue of exponential decaying memory. Theoretical results are justified by numerical verifications.

layer-wise nonlinearity, state-space model, universal approximator, (3 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.62)

Technology: Information Technology > Artificial Intelligence (0.95)

Add feedback